DS 6010 | Summer 2021 | University of Virginia
At around 5:00 PM on January 12th, 2010, a magnitude 7.0 earthquake struck just west of Port-au-Prince, Haiti. The earthquake may have caused over 100,000 deaths, destroyed much of the islands infrastructure, and displaced millions of residents. In the immediate aftermath of the quake, many Haitians chose to shelter under tarps due to the risk of aftershocks causing buildings damaged by the earlier earthquake to collapse. Due to the damage to critical transportation and communications infrastructure, and the country’s geography, coordination of relief efforts proved difficult because displaced persons were hard to locate.
To help locate refugees, a team from the Rochester Institute of Technology surveyed the affected region using aerial photography. Due to the scale of the disaster, a large land area needed to be covered, making the task of manually identifying photographs with blue tarps in them very resource intensive process. Therefore, an algorithm needed to be determined that could quickly and accurately locate potential displaced persons so that humanitarian aid resources could be deployed to those areas.
This project will focus on determining the appropriate algorithm for this application.
The following R packages were utilized in this project:
library(tidyverse)
library(skimr)
library(GGally)
library(caret)
library(broom)
library(yardstick)
library(class)
library(plotly)
library(plotROC)The data was supplied in the form of a .csv file. The function read_csv from the package tidyverse was utilized to read in and store the data in a data frame object called raw_data. This data frame will serve as the record of the as-received data.
raw_data <- read_csv('HaitiPixels.csv',
col_types = list(col_factor(),
col_double(),
col_double(),
col_double()))Next, the skimr package was utilized to provide a summary of the imported data:
skim(raw_data)| Name | raw_data |
| Number of rows | 63241 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1 | FALSE | 5 | Veg: 26006, Soi: 20566, Roo: 9903, Var: 4744 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Red | 0 | 1 | 162.98 | 78.88 | 48 | 80 | 163 | 255 | 255 | ▆▂▂▂▇ |
| Green | 0 | 1 | 153.66 | 72.64 | 48 | 78 | 148 | 226 | 255 | ▇▃▂▃▇ |
| Blue | 0 | 1 | 125.14 | 61.39 | 44 | 63 | 123 | 181 | 255 | ▇▂▃▆▁ |
The skim indicates that the dataframe contains four variables: the Red, Blue, and Green (RGB) color intensity of each pixel, which ranges from 0 to 255, and the classification of the pixel contents ’Class`, which is summarized in the table below:
knitr::kable(table(raw_data$Class), col.names = c('Class',
'Observations'))| Class | Observations |
|---|---|
| Vegetation | 26006 |
| Soil | 20566 |
| Rooftop | 9903 |
| Various Non-Tarp | 4744 |
| Blue Tarp | 2022 |
There are five levels within the categorical variable Class. The target class, Blue Tarp, only makes up a small portion of the data contained in this dataset. This is not surprising, as the dataset contains pixel information for a large area of land, with only a small portion of the land occupied by refugees.
Next, the function ggpairs from the package Ggally was used to visualize data distributions and correlations:
ggpairs(raw_data)There are several points of interest in the ggpairs plot. First, the color values seem to be highly correlated with one another. Second, for the blue tarp classification, the blue value appears to be higher on average than the other classes. This can also be observed in the density curve for blue values, where there is a large concentration of blue values around 200.
After conducting the exploratory data analysis, a working copy of the raw data was saved into a new dataframe called df.
df = raw_dataDuring exploratory data analysis, it was observed that the variable of interest, Class, has five levels. As the objective of this project is to predict if the pixel color values associated with a given area indicates if refugees are sheltering in this area, this categorical variable was collapsed from five classes to two classes. The new classes, BT (Blue Tarp) and NBT (Not Blue Tarp) will be the levels of the new response variable, collpase.Class. In addition, the seed 304 was chosen for repeatable random splits for cross validation.
set.seed(304)
df$collapse.Class = as.factor(ifelse(df$Class == "Blue Tarp", "BT", "NBT"))
df = subset(df, select = -Class)
df$collapse.Class = relevel(df$collapse.Class, ref = "NBT")
df$collapse.Class <- factor(df$collapse.Class, levels=rev(levels(df$collapse.Class)))
plot_preds = data.frame(df$collapse.Class)After collapsing the response variable, the contrasts function was used to check the leveling in collapse.Class:
contrasts(df$collapse.Class)#> NBT
#> BT 0
#> NBT 1
A summary of the new variable collapse.Class can be found below:
knitr::kable(table(df$collapse.Class), col.names = c('Class',
'Observations'))| Class | Observations |
|---|---|
| BT | 2022 |
| NBT | 61219 |
It is clear that the count of pixels where the positive response, BT, is observed is sparse compared to the overall number of pixels observed. After the reduction in the number of levels in Class, the distribution of data in the 3D parameter space can be visualized with clarity. The package plotly was used to generate an interactive 3D scatter plot of each of the observations.
fig <- plot_ly(df, x = ~Red, y = ~Green, z = ~Blue, color = ~collapse.Class, colors = c('#0C4B8E','#BF382A'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(title = 'Distribution of Observations in 3D Parameter Space',
scene = list(xaxis = list(title = 'Red'),
yaxis = list(title = 'Green'),
zaxis = list(title = 'Blue')))
figThe scatter plot indicates that there is a somewhat clear distinction between the classes BT and NBT, with BT pixels having higher Blue values than NBT pixels. However, it is unclear what form the decision boundary should take to create an optimal classification model for this data. Therefore, several classification models were considered including:
To begin the model selection a function, trainControl, from the caret package was utilized to set up a 10-fold cross validation to evaluate each of the models. Additional parameters were added to save the calculated probabilities for each observation, to save the predictions from the cross validation, and to allow for parallel processing.
ctrl = trainControl(method = 'cv',
number = 10,
classProbs = TRUE,
savePredictions = TRUE,
allowParallel = TRUE)The first model examined was a logistic regression model. The function train from the package caret was used to fit a model to predict collapse.Class using the predictors Red, Green, and Blue. The model was then evaluated using a 10-fold cross validation that was setup with the trainControl function.
model.logistic = train(form = collapse.Class~.,
data = df,
method = 'glm',
trControl = ctrl,
family = 'binomial',
trace = FALSE)
model.logistic#> Generalized Linear Model
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'BT', 'NBT'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9953037 0.9210554
model.logistic$finalModel %>% broom::tidy()#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -0.210 0.185 -1.14 2.56e- 1
#> 2 Red 0.260 0.0126 20.6 1.42e- 94
#> 3 Green 0.218 0.0133 16.4 1.46e- 60
#> 4 Blue -0.472 0.0155 -30.5 1.88e-204
The output from the cross validation indicates that this model has very high accuracy (99.530367%). The model form also indicates that all of the predictors chosen (red, green, and blue) are highly important in predicting if a blue tarp is present (no p-values over \(10^{-50}\)!) However, accuracy is not the only characteristic of the model that is important for this application, as it is important that the model correctly identifies areas that have blue tarps and does not waste aid resources by incorrectly labeling areas that do not contain blue tarps as containing them. To further investigate the logistic regression model’s performance, a confusion matrix is needed:
confusionMatrix(model.logistic)#> Cross-Validated (10 fold) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction BT NBT
#> BT 2.8 0.1
#> NBT 0.4 96.7
#>
#> Accuracy (average) : 0.9953
The cross-validation confusion matrix indicates that the model has a sensitivity, or percentage of correctly identified blue tarp areas out of all blue tarp areas of 87.5% and a specificity, or percentage of correctly identified non-blue tarp areas out of all non-blue tarp areas of 99.9%. This confusion matrix is generated using a cutoff value of 0.5, which may not be optimal for this model, as identification of blue tarp areas is of more value than correct identification of non-blue tarp areas. To begin investigating the ideal cutoff value, the model’s Receiver Operating Characteristic (ROC) curve is generated:
r = ggplot(model.logistic$pred, aes(m = BT, d = obs)) +
geom_roc(labels = FALSE, increasing = FALSE) +
labs(title = 'ROC Curve for Logistic Regression Model') +
xlab('False Negative Rate') +
ylab('True Positive Rate')
r + annotate("text", x = .8, y = .15, label = paste("AUC =", round(calc_auc(r)$AUC, 4)))The ROC curve’s proximity to the top left of the plot and it’s associated area under the curve (AUC) above indicates that the logistic regression model performs very well at predicting if a given pixel represents a blue tarp or not. To determine the optimal cutoff value for the model, an interactive plot was created to explore the model’s performance metrics for different cutoff values:
sens = c()
spec = c()
acc = c()
tnr = c()
cutoff = seq(0, 1, by=0.01)
for (i in 1:length(cutoff)){
newpred = factor(ifelse(model.logistic$pred$BT > cutoff[i], "BT", "NBT"), levels = c("BT", "NBT"))
sens[i] = sens_vec(model.logistic$pred$obs, newpred)
spec[i] = spec_vec(model.logistic$pred$obs, newpred)
acc[i] = accuracy_vec(model.logistic$pred$obs, newpred)
tnr[i] = 1 - spec[i]
}
cutoff_plot_data = data.frame(Cutoff = cutoff, Sensitivity = sens, Specificity = spec, Accuracy = acc, TNR = tnr)
p = ggplot(reshape2::melt(cutoff_plot_data, id.var='Cutoff')) +
geom_line(aes(x = Cutoff, y = value, color = variable)) +
scale_x_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) +
labs(title = 'Logistic Regression Model Performance by Cutoff Value') +
xlab('Probability Cutoff Value')
ggplotly(p, dynamicTicks = TRUE) %>% rangeslider() %>% layout(hovermode = 'x')After investigating the interactive plot, a cutoff value of 0.1 was selected for the model. The associated confusion matrix for this value is shown below:
newpred = factor(ifelse(model.logistic$pred$BT > 0.1, "BT", "NBT"), levels = c("BT", "NBT"))
plot_preds = cbind(plot_preds,newpred)
logistic.cm = confusionMatrix(newpred, model.logistic$pred$obs)
logistic.cm#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction BT NBT
#> BT 1952 510
#> NBT 70 60709
#>
#> Accuracy : 0.9908
#> 95% CI : (0.9901, 0.9916)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.8659
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.96538
#> Specificity : 0.99167
#> Pos Pred Value : 0.79285
#> Neg Pred Value : 0.99885
#> Prevalence : 0.03197
#> Detection Rate : 0.03087
#> Detection Prevalence : 0.03893
#> Balanced Accuracy : 0.97853
#>
#> 'Positive' Class : BT
#>
Using a cutoff of 0.1 improves the model’s sensitivity from 87.5% to 96.5% while only reducing its specificity from 0.995 to 0.991, which represents a large improvement in the model’s utility to aid workers.
A summary of the logistic regression model with the selected cutoff value can be found below:
logistic.results = data.frame(Model = 'Logistic Regression', Tuning = 'NA', AUROC = round(calc_auc(r)$AUC, 4), Threshold = 0.1, Accuracy = round(logistic.cm$overall[[1]],4), TPR = round(logistic.cm$byClass[[1]],4), FNR = round(1-logistic.cm$byClass[[2]],4), Precision = round(logistic.cm$byClass[[5]],4))
knitr::kable(logistic.results)| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FNR | Precision |
|---|---|---|---|---|---|---|---|
| Logistic Regression | NA | 0.9985 | 0.1 | 0.9908 | 0.9654 | 0.0083 | 0.7929 |
The next model that was explored was a linear discriminant analysis (LDA) model. LDA may improve on the performance of the logistic regression model by generating a linear decision boundary that maximized the separation between the two classes. The model was trained using the train method of the caret class and evaluated using a 10-fold cross validation:
model.lda = train(collapse.Class~.,
data = df,
method = 'lda',
trControl = ctrl,
trace = FALSE)
model.lda#> Linear Discriminant Analysis
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'BT', 'NBT'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56918, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9839661 0.7533183
model.lda$finalModel#> Call:
#> lda(x, grouping = y, trace = FALSE)
#>
#> Prior probabilities of groups:
#> BT NBT
#> 0.03197293 0.96802707
#>
#> Group means:
#> Red Green Blue
#> BT 169.6627 186.4149 205.0371
#> NBT 162.7604 152.5808 122.4993
#>
#> Coefficients of linear discriminants:
#> LD1
#> Red 0.02896984
#> Green 0.02328544
#> Blue -0.06923974
Again, the initial model fit shows promising accuracy. Further investigation using the model’s confusion matrix and ROC curve is shown below:
confusionMatrix(model.lda)#> Cross-Validated (10 fold) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction BT NBT
#> BT 2.6 1.0
#> NBT 0.6 95.8
#>
#> Accuracy (average) : 0.984
r = ggplot(model.lda$pred, aes(m = BT, d = obs)) +
geom_roc(labels = FALSE, increasing = FALSE) +
labs(title = 'ROC Curve for Linear Discriminant Analysis (LDA) Model') +
xlab('False Negative Rate') +
ylab('True Positive Rate')
auc = round(calc_auc(r)$AUC, 4)
r + annotate("text", x = .8, y = .15, label = paste("AUC =", round(calc_auc(r)$AUC, 4)))The confusion matrix for the LDA model indicates that the model has an accuracy of 98.3966059%. The ROC curve proximity to the upper left of the plot area and associated \(AUC = 0.9888\), indicates that this model performs very well at categorizing blue tarp vs not blue tarp pixels overall. However, this is not the ideal measure of the performance of the model, as the 0.5 probability threshold used when generating the confusion matrix yields a True Positive Rate of \(\dfrac{2.6}{2.6 + 0.6} = 0.8125\). The low TPR associated with the 0.5 cutoff is most likely due to the relative sparsity of BT data points compared to NBT data points, as seen in the EDA section. A low TPR is not ideal, as the model may classify blue tarp areas as non-blue tarp areas, causing the deployment of resources to endangered people to be slowed. Choosing another value for the cutoff value may improve the performance of the model. The performance metrics for the LDA model at different cutoff values is shown below:
sens = c()
spec = c()
acc = c()
tnr = c()
cutoff = seq(0, 1, by=0.01)
for (i in 1:length(cutoff)){
newpred = factor(ifelse(model.lda$pred$BT > cutoff[i], "BT", "NBT"), levels = c("BT", "NBT"))
sens[i] = sens_vec(model.lda$pred$obs, newpred)
spec[i] = spec_vec(model.lda$pred$obs, newpred)
acc[i] = accuracy_vec(model.lda$pred$obs, newpred)
tnr[i] = 1 - spec[i]
}
cutoff_plot_data = data.frame(Cutoff = cutoff, Sensitivity = sens, Specificity = spec, Accuracy = acc, TNR = tnr)
p = ggplot(reshape2::melt(cutoff_plot_data, id.var='Cutoff')) +
geom_line(aes(x = Cutoff, y = value, color = variable)) +
scale_x_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) +
labs(title = 'LDA Model Performance by Cutoff Value') +
xlab('Probability Cutoff Value')
ggplotly(p, dynamicTicks = TRUE) %>% rangeslider() %>% layout(hovermode = 'x')The interactive plot above indicates that choosing a cutoff value of 0.05 to produce a blue tarp prediction improves the TPR of the model by 5%, to 86%. While not ideal, this improvement could result in the correct identification of many more temporary shelters to bring relief aid to. The confusion matrix for a cutoff of 0.05 is shown below:
newpred = factor(ifelse(model.lda$pred$BT > 0.05, "BT", "NBT"), levels = c("BT", "NBT"))
plot_preds = cbind(plot_preds,newpred)
lda.cm = confusionMatrix(newpred, model.lda$pred$obs)
lda.cm#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction BT NBT
#> BT 1739 911
#> NBT 283 60308
#>
#> Accuracy : 0.9811
#> 95% CI : (0.98, 0.9822)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.7348
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.86004
#> Specificity : 0.98512
#> Pos Pred Value : 0.65623
#> Neg Pred Value : 0.99533
#> Prevalence : 0.03197
#> Detection Rate : 0.02750
#> Detection Prevalence : 0.04190
#> Balanced Accuracy : 0.92258
#>
#> 'Positive' Class : BT
#>
A summary table for the LDA model performance is shown below:
lda.results = data.frame(Model = 'LDA', Tuning = 'NA', AUROC = round(calc_auc(r)$AUC, 4), Threshold = 0.1, Accuracy = round(lda.cm$overall[[1]],4), TPR = round(lda.cm$byClass[[1]],4), FNR = round(1-lda.cm$byClass[[2]],4), Precision = round(lda.cm$byClass[[5]],4))
knitr::kable(lda.results)| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FNR | Precision |
|---|---|---|---|---|---|---|---|
| LDA | NA | 0.9888 | 0.1 | 0.9811 | 0.86 | 0.0149 | 0.6562 |
The next model explored was a quadratic discriminant analysis (QDA) model. Like LDA, QDA generates a decision boundary between the classes that maximizes the separation between each class, however QDA does not assume that the classes have equal variance, and therefore provides a more flexible model fit and quadratic decision boundary, which may be useful in this case, where the decision boundary may be non-linear. Again, the model was trained using the train method of the caret class.
model.qda = train(collapse.Class~.,
data = df,
method = 'qda',
trControl = ctrl,
trace = FALSE)
model.qda#> Quadratic Discriminant Analysis
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'BT', 'NBT'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9945446 0.9045071
model.qda$finalModel#> Call:
#> qda(x, grouping = y, trace = FALSE)
#>
#> Prior probabilities of groups:
#> BT NBT
#> 0.03197293 0.96802707
#>
#> Group means:
#> Red Green Blue
#> BT 169.6627 186.4149 205.0371
#> NBT 162.7604 152.5808 122.4993
The QDA results shown above are promising, the overall model accuracy has improved to 99%. A confusion matrix was generated for the 10-fold cross validation to visualize model performance at a cutoff value of 0.5:
confusionMatrix(model.qda)#> Cross-Validated (10 fold) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction BT NBT
#> BT 2.7 0.0
#> NBT 0.5 96.8
#>
#> Accuracy (average) : 0.9945
The confusion matrix indicates that the QDA model outperforms the linear discriminant analysis and is nearly identical to the performance of the logistic regression model. Next a receiver operating characteristic (ROC) curve was generated to visualize the model performance across all cutoff values:
r = ggplot(model.qda$pred, aes(m = BT, d = obs)) +
geom_roc(labels = FALSE, increasing = FALSE) +
labs(title = 'ROC Curve for Quadratic Discriminant Analysis (QDA) Model') +
xlab('False Negative Rate') +
ylab('True Positive Rate')
auc = round(calc_auc(r)$AUC, 4)
r + annotate("text", x = .8, y = .15, label = paste("AUC =", round(calc_auc(r)$AUC, 4)))The performance of the QDA model is nearly identical to the logistic regression model, with an AUC of 0.9982. Next, the cutoff value was optimized to balance sensitivity and specificity. The interactive plot below was utilized to select an optimal cutoff value:
sens = c()
spec = c()
acc = c()
tnr = c()
cutoff = seq(0, 1, by=0.01)
for (i in 1:length(cutoff)){
newpred = factor(ifelse(model.qda$pred$BT > cutoff[i], "BT", "NBT"), levels = c("BT", "NBT"))
sens[i] = sens_vec(model.qda$pred$obs, newpred)
spec[i] = spec_vec(model.qda$pred$obs, newpred)
acc[i] = accuracy_vec(model.qda$pred$obs, newpred)
tnr[i] = 1 - spec[i]
}
cutoff_plot_data = data.frame(Cutoff = cutoff, Sensitivity = sens, Specificity = spec, Accuracy = acc, TNR = tnr)
p = ggplot(reshape2::melt(cutoff_plot_data, id.var='Cutoff')) +
geom_line(aes(x = Cutoff, y = value, color = variable)) +
scale_x_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) +
labs(title = 'QDA Model Performance by Cutoff Value') +
xlab('Probability Cutoff Value')
ggplotly(p, dynamicTicks = TRUE) %>% rangeslider() %>% layout(hovermode = 'x')It is clear that beyond a cutoff of 0.1, model performance is not very much improved, therefore a confusion matrix was generated using a cutoff value of 0.1.
newpred = factor(ifelse(model.qda$pred$BT > 0.1, "BT", "NBT"), levels = c("BT", "NBT"))
plot_preds = cbind(plot_preds,newpred)
qda.cm = confusionMatrix(newpred, model.qda$pred$obs)
qda.cm#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction BT NBT
#> BT 1818 456
#> NBT 204 60763
#>
#> Accuracy : 0.9896
#> 95% CI : (0.9887, 0.9903)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.841
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.89911
#> Specificity : 0.99255
#> Pos Pred Value : 0.79947
#> Neg Pred Value : 0.99665
#> Prevalence : 0.03197
#> Detection Rate : 0.02875
#> Detection Prevalence : 0.03596
#> Balanced Accuracy : 0.94583
#>
#> 'Positive' Class : BT
#>
Using a cutoff of 0.1, the model’s TPR improves from 84.4% to 90.0%. The chosen cutoff has increased the sensitivity by nearly 6% while only decreasing the global model accuracy by less than a percent.
A summary table of the QDA model results can be found below:
qda.results = data.frame(Model = 'QDA', Tuning = 'NA', AUROC = round(calc_auc(r)$AUC, 4), Threshold = 0.1, Accuracy = round(qda.cm$overall[[1]],4), TPR = round(qda.cm$byClass[[1]],4), FNR = round(1-qda.cm$byClass[[2]],4), Precision = round(qda.cm$byClass[[5]],4))
knitr::kable(qda.results)| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FNR | Precision |
|---|---|---|---|---|---|---|---|
| QDA | NA | 0.9982 | 0.1 | 0.9896 | 0.8991 | 0.0074 | 0.7995 |
The next model tested was a K-Nearest Neighbors (KNN) model. While the parametric models have all shown to be quite accurate in predicting the presence of tarps, perhaps the non-parametric and highly flexible KNN model will yield better results.
The first step in selecting the KNN model is to determine the optimal value for \(k\), or how many neighbors will “vote” on each point to determine its class. To obtain an optimal value for \(k\) the train method of the caret package is utilized:
model.knn = train(collapse.Class~.,
data = df,
method = 'knn',
trControl = ctrl,
tuneGrid = expand.grid(k = seq(1, 21, by = 2)))
model.knn#> k-Nearest Neighbors
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: 'BT', 'NBT'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 56917, 56918, 56916, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 1 0.9968849 0.9494116
#> 3 0.9972170 0.9550278
#> 5 0.9973119 0.9567646
#> 7 0.9972012 0.9550364
#> 9 0.9971538 0.9542424
#> 11 0.9971063 0.9534146
#> 13 0.9972012 0.9548647
#> 15 0.9971221 0.9536106
#> 17 0.9970747 0.9527917
#> 19 0.9970273 0.9520740
#> 21 0.9971063 0.9533568
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 5.
ggplot(model.knn) + theme_bw()The plot and model summary above indicate that \(k = 5\) is optimal based on a 10-fold cross validation. It appears that model performance increases to this point and then drops off as k continues to increase. The confusion matrix for \(k = 5\) is shown below:
knn.cm = confusionMatrix(model.knn$pred$pred,model.knn$pred$obs)
plot_preds = cbind(plot_preds,model.knn$pred$pred)
knn.cm#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction BT NBT
#> BT 21279 1035
#> NBT 963 672374
#>
#> Accuracy : 0.9971
#> 95% CI : (0.997, 0.9973)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9537
#>
#> Mcnemar's Test P-Value : 0.1122
#>
#> Sensitivity : 0.95670
#> Specificity : 0.99846
#> Pos Pred Value : 0.95362
#> Neg Pred Value : 0.99857
#> Prevalence : 0.03197
#> Detection Rate : 0.03059
#> Detection Prevalence : 0.03208
#> Balanced Accuracy : 0.97758
#>
#> 'Positive' Class : BT
#>
It appears that this model is the best performing model explored so far, with an accuracy of 0.9971% and a TPR of 0.9567%. Since the class of each prediction is determined by its k-nearest neighbor’s “votes”, there is no cutoff value to select, and this model can interprted as-is. A table of the KNN-model’s results can be found below:
knn.results = data.frame(Model = 'KNN', Tuning = 'k=5', AUROC = 'NA', Threshold = 'NA', Accuracy = round(knn.cm$overall[[1]],4), TPR = round(knn.cm$byClass[[1]],4), FNR = round(1-knn.cm$byClass[[2]],4), Precision = round(knn.cm$byClass[[5]],4))
knitr::kable(knn.results)| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FNR | Precision |
|---|---|---|---|---|---|---|---|
| KNN | k=5 | NA | NA | 0.9971 | 0.9567 | 0.0015 | 0.9536 |
The final model examined in this exploration is a penalized logistic regression using ElasticNet. This model is similar to Least Absolute Shrinkage and Selection Operator (LASSO) and Ridge regression, in that it includes a penalty term. However, ElasticNet regression combines the best of both of those models, in that it allows for some of the coefficients to shrink, and some to go to 0. This property may allow the model to outperform the others that have already been examined.
To begin, the tuning parameters, \(\alpha\) and \(\lambda\) must be selected. The method train from the caret package was utilized to select these parameters:
model.elnet = train(
collapse.Class ~ .,
data = df,
method = "glmnet",
family = 'binomial',
preProcess = c('center','scale'),
trControl = ctrl)
alpha = model.elnet$bestTune$alpha
lambda = model.elnet$bestTune$lambdaThe train method examined 3 values for \(\alpha\) and \(\lambda\), and selected \(\alpha = 1\) and \(\lambda = 8.3221545\times 10^{-5}\), indicating that the ElasticNet regression actually decided on a LASSO Regression due to \(\alpha = 1\). The coefficients of the model selected is shown below:
coef(model.elnet$finalModel, model.elnet$finalModel$lambdaOpt)#> 4 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) 12.96764
#> Red 16.33537
#> Green 10.57268
#> Blue -21.57395
A visualization of the model performance at each value of \(\alpha\) and \(\lambda\) is shown below:
ggplot(model.elnet) + geom_line() + geom_point() + labs(title = 'ElasticNet Model Performance by Tuning Parameter Choice')The graphic indicates that the best value of the mixing percentage, \(\alpha\) is 1 and the best value for the regularization parameter, \(\lambda\) is 0.00832. The value of \(\lambda\) indicates that the selected LASSO model very closely approximates a normal logistic regression model, as the penalty term is not heavily weighted. The low weight to the penalty term indicates that all three variables (Red, Green, and Blue) likely provide utility in classification of BT pixels vs NBT pixels.
The model parameters above create a model with the following confusion matrix for a cutoff of 0.5:
confusionMatrix(model.elnet)#> Cross-Validated (10 fold) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction BT NBT
#> BT 2.8 0.1
#> NBT 0.4 96.7
#>
#> Accuracy (average) : 0.9952
It appears that the ElasticNet model has very good performance, matching the performance of the other parametric models with an accuracy of 0.9952 and a TPR of 0.875. The ROC curve for the ElasticNet model is shown below:
r = ggplot(model.elnet$pred, aes(m = BT, d = obs)) +
geom_roc(labels = FALSE, increasing = FALSE) +
labs(title = 'Elasticnet Model ROC Curve') +
xlab('False Negative Rate') +
ylab('True Positive Rate')
auc = round(calc_auc(r)$AUC, 4)
r + annotate("text", x = .8, y = .15, label = paste("AUC =", round(calc_auc(r)$AUC, 4)))The ROC curve for the model indicates that the ROC curve has an \(AUC = 0.9919\), which indicates a very impressive model. Finally, the interactive plot below was utilized to improve the TPR of the model by selecting a more optimal cutoff value:
sens = c()
spec = c()
acc = c()
tnr = c()
cutoff = seq(0, 1, by=0.01)
for (i in 1:length(cutoff)){
newpred = factor(ifelse(model.elnet$pred$BT > cutoff[i], "BT", "NBT"), levels = c("BT", "NBT"))
sens[i] = sens_vec(model.elnet$pred$obs, newpred)
spec[i] = spec_vec(model.elnet$pred$obs, newpred)
acc[i] = accuracy_vec(model.elnet$pred$obs, newpred)
tnr[i] = 1 - spec[i]
}
cutoff_plot_data = data.frame(Cutoff = cutoff, Sensitivity = sens, Specificity = spec, Accuracy = acc, TNR = tnr)
p = ggplot(reshape2::melt(cutoff_plot_data, id.var='Cutoff')) +
geom_line(aes(x = Cutoff, y = value, color = variable)) +
scale_x_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) +
labs(title = 'Elasticnet Regression Model Performance by Cutoff Value') +
xlab('Probability Cutoff Value')
ggplotly(p, dynamicTicks = TRUE) %>% rangeslider() %>% layout(hovermode = 'x')A cutoff of 0.05 was selected, as TPR did not increase significantly beyond this point. The associated confusion matrix for the cutoff of 0.05 is shown below:
newpred = factor(ifelse(model.elnet$pred$BT > 0.05, "BT", "NBT"), levels = c("BT", "NBT"))
plot_preds = cbind(plot_preds,model.knn$pred$pred)
elnet.cm = confusionMatrix(newpred, model.elnet$pred$obs)
elnet.cm#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction BT NBT
#> BT 17095 24445
#> NBT 1103 526526
#>
#> Accuracy : 0.9551
#> 95% CI : (0.9546, 0.9556)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.5524
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.93939
#> Specificity : 0.95563
#> Pos Pred Value : 0.41153
#> Neg Pred Value : 0.99791
#> Prevalence : 0.03197
#> Detection Rate : 0.03004
#> Detection Prevalence : 0.07298
#> Balanced Accuracy : 0.94751
#>
#> 'Positive' Class : BT
#>
By decreasing the cutoff value to 0.05, the model’s TPR increases to 0.939! However, the overall performance of the model suffers in this case, reducing the overall model accuracy to 0.955. However, this value was chosen to make sure that as many people in need were reached as possible. A summary table of the model results can be found below:
elnet.results = data.frame(Model = 'Elasticnet', Tuning = c('alpha = 1, lambda = 8.32e-0.5'), AUROC = round(calc_auc(r)$AUC, 4), Threshold = 0.1, Accuracy = round(elnet.cm$overall[[1]],4), TPR = round(elnet.cm$byClass[[1]],4), FNR = round(1-elnet.cm$byClass[[2]],4), Precision = round(elnet.cm$byClass[[5]],4))
knitr::kable(elnet.results)| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FNR | Precision |
|---|---|---|---|---|---|---|---|
| Elasticnet | alpha = 1, lambda = 8.32e-0.5 | 0.9919 | 0.1 | 0.9551 | 0.9394 | 0.0444 | 0.4115 |
The summary of the results from each of the model fits can be found below:
results = rbind(logistic.results, lda.results, qda.results, knn.results, elnet.results)
knitr::kable(results)| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FNR | Precision |
|---|---|---|---|---|---|---|---|
| Logistic Regression | NA | 0.9985 | 0.1 | 0.9908 | 0.9654 | 0.0083 | 0.7929 |
| LDA | NA | 0.9888 | 0.1 | 0.9811 | 0.8600 | 0.0149 | 0.6562 |
| QDA | NA | 0.9982 | 0.1 | 0.9896 | 0.8991 | 0.0074 | 0.7995 |
| KNN | k=5 | NA | NA | 0.9971 | 0.9567 | 0.0015 | 0.9536 |
| Elasticnet | alpha = 1, lambda = 8.32e-0.5 | 0.9919 | 0.1 | 0.9551 | 0.9394 | 0.0444 | 0.4115 |
Each of the results was the average generated from a 10-fold cross validation of the data and the selection of the optimal cutoff value.
Out of the 5 models tested, K-Nearest Neighbors with \(k = 5\) appears to be the best performing model. This is not surprising, as the 3D scatterplot generated during the exploratory data analysis indicated that the decision boundary may be highly non-linear. The flexible KNN model created a highly non-linear decision boundary which allowed the model to predict the presence of blue tarps very accurately. A visualization of the model predictions is shown below:
preds = predict(model.knn, newdata = df)
classification = preds == df$collapse.Class
classification = data.frame(preds, classification)
classification = ifelse(classification$preds == 'BT',
ifelse(classification$classification == TRUE, 'TP', 'FP'),
ifelse(classification$classification == TRUE, 'TN', 'FN'))
classification = as.factor(classification)
fig <- plot_ly(df, x = ~Red, y = ~Green, z = ~Blue, color = ~classification, colors = c('cornflowerblue', 'tomato', '#BF382A','#0C4B8E'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(title = 'Distribution of Predictions in 3D Parameter Space',
scene = list(xaxis = list(title = 'Red'),
yaxis = list(title = 'Green'),
zaxis = list(title = 'Blue')))
figThe traditional drawback of using a flexible model such as KNN is that there is high variance associated with these models. However, due to the large number of data points in this dataset the KNN model’s 10-fold cross validation performance was excellent. If a smaller dataset was used to train each model, KNN may have had a larger cross validation error due to its natural higher variance. Overall, there can be a moderately-high level of confidence in the cross-validation results, as the dataset is very large compared to the number of variables and classes to predict.
The selected model, K-Nearest Neighbors, would be very helpful to aid humanitarian relief efforts for two reasons. First, aid services could have a high degree of confidence in the suggestions, as the KNN model has a very high TPR of 95.63%. This indicates that when the model indicates that an area contains blue tarps, the user can have a high degree of confidence in it. Second, the model has an overall accuracy of over 99%, therefore the aid services could use the model to aid decision making in where to deploy resources. High accuracy is important as false postives and false negatives both waste precious time and resources that could be deployed to help people in need.
To further improve this model’s performance and utility there are several options that could be explored in future projects. First, supplemental algorithms could also be developed to indicate if a prediction falls sufficiently close to the prediction boundary and to flag it for further review by humans.
Next, additional data on latitude and longitude locations of each of the data points could also allow for algorithms to identify clusters of blue tarps in physical space. The addition of location data would allow aid services to deploy their resources even more effectively.
Finally, data pre-processing and feature engineering could be explored further to increase the true positive rate further. Upsampling methods such as SMOTE could be used to increase the prevalence of the blue tarp feature in the dataset, which could improve the model’s predictive power for that class. Likewise, downsampling techniques could be used to balance the data by removing some of the NBT observations from the training dataset.
While training and evaluating each of these models, the run time of each algorithm was also observed. The following observations were made:
While KNN was the best performing model for this application, it was also the slowest to train, with a significantly longer run time than any of the other models. If there were significantly more data points, it would be prudent to split the data into training and test sets, perform the model fit and cross validation on the training set, and then evaluate the final model performance using the test set.
The run time of the ElasticNet logistic regression model was also somewhat longer than that of the normal logistic regression model, LDA, and QDA models. For larger data sets it may also be appropriate to perform the training-test split on this model as well.
The run time of the Logistic Regression, LDA, and QDA models were similar and much smaller than that of the KNN or ElasticNet models. This property could make them useful in situations where aid resources had a larger data set to process and limited access to large computing resources.
This dataset is a very good candidate for predictive modeling tools for three reasons. First, the dataset contains only features that are important for prediction, making the task of feature selection easier. Second, each feature also has the same scaling as all of the other features (0-255 for RBG values), removing the need for some pre-processing activities. Finally, the dataset is large compared to the number of features in each of the models, which reduces the amount of variance in the final models.
This project was completed in the following environment:
sessionInfo()#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] plotROC_2.2.1 plotly_4.9.4.1 class_7.3-19 yardstick_0.0.8
#> [5] broom_0.7.8 caret_6.0-88 lattice_0.20-44 GGally_2.1.2
#> [9] skimr_2.1.3 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6
#> [13] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.2
#> [17] ggplot2_3.3.5 tidyverse_1.3.1
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-152 fs_1.5.0 lubridate_1.7.10
#> [4] RColorBrewer_1.1-2 httr_1.4.2 repr_1.1.3
#> [7] tools_4.1.0 backports_1.2.1 bslib_0.2.5.1
#> [10] utf8_1.2.1 R6_2.5.0 rpart_4.1-15
#> [13] lazyeval_0.2.2 DBI_1.1.1 colorspace_2.0-1
#> [16] nnet_7.3-16 withr_2.4.2 tidyselect_1.1.1
#> [19] compiler_4.1.0 glmnet_4.1-1 cli_2.5.0
#> [22] rvest_1.0.0 xml2_1.3.2 labeling_0.4.2
#> [25] sass_0.4.0 scales_1.1.1 proxy_0.4-26
#> [28] digest_0.6.27 rmarkdown_2.9 base64enc_0.1-3
#> [31] pkgconfig_2.0.3 htmltools_0.5.1.1 highr_0.9
#> [34] dbplyr_2.1.1 htmlwidgets_1.5.3 rlang_0.4.11
#> [37] readxl_1.3.1 rstudioapi_0.13 shape_1.4.6
#> [40] farver_2.1.0 jquerylib_0.1.4 generics_0.1.0
#> [43] jsonlite_1.7.2 crosstalk_1.1.1 ModelMetrics_1.2.2.2
#> [46] magrittr_2.0.1 Matrix_1.3-3 Rcpp_1.0.6
#> [49] munsell_0.5.0 fansi_0.5.0 lifecycle_1.0.0
#> [52] stringi_1.6.2 pROC_1.17.0.1 yaml_2.2.1
#> [55] MASS_7.3-54 plyr_1.8.6 recipes_0.1.16
#> [58] grid_4.1.0 crayon_1.4.1 haven_2.4.1
#> [61] splines_4.1.0 hms_1.1.0 knitr_1.33
#> [64] pillar_1.6.1 reshape2_1.4.4 codetools_0.2-18
#> [67] stats4_4.1.0 reprex_2.0.0 glue_1.4.2
#> [70] evaluate_0.14 data.table_1.14.0 modelr_0.1.8
#> [73] vctrs_0.3.8 foreach_1.5.1 cellranger_1.1.0
#> [76] gtable_0.3.0 reshape_0.8.8 assertthat_0.2.1
#> [79] xfun_0.24 gower_0.2.2 prodlim_2019.11.13
#> [82] e1071_1.7-7 viridisLite_0.4.0 survival_3.2-11
#> [85] timeDate_3043.102 iterators_1.0.13 lava_1.6.9
#> [88] ellipsis_0.3.2 ipred_0.9-11